Oh, and welcome to today's video nugget on grammars for natural language processing.
I'm sure that at least the computer scientists among you know things about grammars because
grammars were built for handling formal languages.
They're tools for describing formal languages succinctly.
Now I would like to basically see whether grammar or other symbolic methods might actually
be good for our problem of natural language processing.
Remembering, of course, for instance, when you learned English in school or something
like this, one of the things you were told were grammar things like in English we do
subject predicate object and you use the progressive past for anything blah blah blah and so on.
So grammars are on one hand things we use for natural language and for formal languages
and we kind of reconcile this and see how we can use this in language based AI.
So I would like to motivate this from the point of view of language models.
We remember that while character based language models work well, word based language models
have a problem because we don't have enough data to build these huge models.
We've seen a couple of things like the unknown words and out of vocabulary words and so on.
But what we would like to do here is essentially use symbolic or symbolic statistical methods
and the general idea is that we cluster words into what we call syntactic classes and rather
than having a model for acceptable sequences is what a language model for words is, really
what we try to write down or get a model for are acceptable word class sequences and these
kind of word class sequence description languages or word models we call phrase structure grammars.
The advantage of this is since we cluster words into classes first, then we actually
can get by with much less information.
So we can do a language model for say the German language by something like 10 to the
power of three structural rules over a lexicon of a hundred thousand words and if these generate
most German acceptable sentences.
And this generative capacity of grammars is something we're interested in.
It gives us relatively good generalizability of the models we have and it condenses the
information.
We can get by with 10 to the 5 facts, if you will, rather than the estimated 10 to the
15 facts, which we estimated we need for a German word trigram model.
So that's what we're after here in this section.
Many animals, lower animals, below kind of the primates and certain birds actually use
isolated symbols for sentences and they kind of have can communicate propositions like
marmosets can identify certain threats, usually from the air by different signals, but they
don't have a sentence structures.
Grammar is a great for information sparsity but they have a disadvantage of course that
they over or an under generalized as any symbolic model does.
And we're building on work by Noam Chomsky who use grammars for for language first.
Okay, so let's go into the theory.
I'm assuming many of you have seen this before.
So we define a phrase structure grammar to be a quadruple that has a set of a finite
set of non terminal signals as symbols.
Here in a phrase structure grammars we also call them syntactic categories.
Here we have the syntactic and this little grammar we have a syntactic category for sentences
for noun phrases for articles for nouns and for intransitive verbs.
And then we have a set of finite set actually of production rules which basically are rewrite
rules.
The head is written rewritten to the body, where the head is essentially made up in a
certain way of a as a sequence of terminal and non terminal symbols.
Okay, I forgot the non terminal signals, the terminal signals with the symbols, which is
Presenters
Zugänglich über
Offener Zugang
Dauer
00:45:55 Min
Aufnahmedatum
2021-07-09
Hochgeladen am
2021-07-09 11:17:05
Sprache
en-US